A Jupyter notebook showing a Deep Learning problem description, EDA procedure, analysis (model building and training), result, and discussion/conclusion.
The problem to solve is to suggest the countries that are in the direct need of aid for the CEO of HELP International. The job is to categorize the countries using some socio-economic and health factors that determine the overall development of the country. For this problem, I will K-means algorithm to divide the data into categories.
The dataset comes from Kaggle (https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data). There are 167 rows and 10 columns.
# Basic libraries
import pandas as pd
import numpy as np
import math
import statistics
import time
from collections import Counter
# Plots
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# Data processing, metrics and modeling
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.preprocessing import StandardScaler
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score as sil_score
# Supress warning
import warnings
warnings.filterwarnings('ignore')
# Read the data
data_ini = pd.read_csv('./data/Country-data.csv')
display(data_ini.info(),data_ini.head(),data_ini.describe())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 167 entries, 0 to 166 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 167 non-null object 1 child_mort 167 non-null float64 2 exports 167 non-null float64 3 health 167 non-null float64 4 imports 167 non-null float64 5 income 167 non-null int64 6 inflation 167 non-null float64 7 life_expec 167 non-null float64 8 total_fer 167 non-null float64 9 gdpp 167 non-null int64 dtypes: float64(7), int64(2), object(1) memory usage: 13.2+ KB
None
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| count | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 |
| mean | 38.270060 | 41.108976 | 6.815689 | 46.890215 | 17144.688623 | 7.781832 | 70.555689 | 2.947964 | 12964.155689 |
| std | 40.328931 | 27.412010 | 2.746837 | 24.209589 | 19278.067698 | 10.570704 | 8.893172 | 1.513848 | 18328.704809 |
| min | 2.600000 | 0.109000 | 1.810000 | 0.065900 | 609.000000 | -4.210000 | 32.100000 | 1.150000 | 231.000000 |
| 25% | 8.250000 | 23.800000 | 4.920000 | 30.200000 | 3355.000000 | 1.810000 | 65.300000 | 1.795000 | 1330.000000 |
| 50% | 19.300000 | 35.000000 | 6.320000 | 43.300000 | 9960.000000 | 5.390000 | 73.100000 | 2.410000 | 4660.000000 |
| 75% | 62.100000 | 51.350000 | 8.600000 | 58.750000 | 22800.000000 | 10.750000 | 76.800000 | 3.880000 | 14050.000000 |
| max | 208.000000 | 200.000000 | 17.900000 | 174.000000 | 125000.000000 | 104.000000 | 82.800000 | 7.490000 | 105000.000000 |
Identifying if there is null values on the features.
# Checking for null values
data_ini.isnull().any()
country False child_mort False exports False health False imports False income False inflation False life_expec False total_fer False gdpp False dtype: bool
Dropping the country column because it is not needed for the modeling.
data = data_ini.drop(columns = 'country')
data.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| child_mort | 167.0 | 38.270060 | 40.328931 | 2.6000 | 8.250 | 19.30 | 62.10 | 208.00 |
| exports | 167.0 | 41.108976 | 27.412010 | 0.1090 | 23.800 | 35.00 | 51.35 | 200.00 |
| health | 167.0 | 6.815689 | 2.746837 | 1.8100 | 4.920 | 6.32 | 8.60 | 17.90 |
| imports | 167.0 | 46.890215 | 24.209589 | 0.0659 | 30.200 | 43.30 | 58.75 | 174.00 |
| income | 167.0 | 17144.688623 | 19278.067698 | 609.0000 | 3355.000 | 9960.00 | 22800.00 | 125000.00 |
| inflation | 167.0 | 7.781832 | 10.570704 | -4.2100 | 1.810 | 5.39 | 10.75 | 104.00 |
| life_expec | 167.0 | 70.555689 | 8.893172 | 32.1000 | 65.300 | 73.10 | 76.80 | 82.80 |
| total_fer | 167.0 | 2.947964 | 1.513848 | 1.1500 | 1.795 | 2.41 | 3.88 | 7.49 |
| gdpp | 167.0 | 12964.155689 | 18328.704809 | 231.0000 | 1330.000 | 4660.00 | 14050.00 | 105000.00 |
Identifying which variable is more correlated by which variable.
plt.figure(figsize = (12, 7))
sns.heatmap(data.corr(), annot = True, cmap="BuPu")
#plt.savefig('seismic',dpi=1000)
plt.show()
Finding the relationship between the variables where they can be continuous of categorical.
sns.pairplot(data,diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x1c6c710d130>
scale = StandardScaler()
sData = pd.DataFrame(scale.fit_transform(data), columns = data.columns) # Scaled Data
sData.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| child_mort | 167.0 | -3.722904e-17 | 1.003008 | -0.887138 | -0.746619 | -0.471798 | 0.592667 | 4.221297 |
| exports | 167.0 | 2.127373e-16 | 1.003008 | -1.500192 | -0.633337 | -0.223528 | 0.374720 | 5.813835 |
| health | 167.0 | 5.504579e-16 | 1.003008 | -1.827827 | -0.692211 | -0.181001 | 0.651541 | 4.047436 |
| imports | 167.0 | 2.765585e-16 | 1.003008 | -1.939940 | -0.691479 | -0.148743 | 0.491353 | 5.266181 |
| income | 167.0 | -7.977650e-17 | 1.003008 | -0.860326 | -0.717456 | -0.373808 | 0.294237 | 5.611542 |
| inflation | 167.0 | -1.063687e-17 | 1.003008 | -1.137852 | -0.566641 | -0.226950 | 0.281636 | 9.129718 |
| life_expec | 167.0 | 3.696311e-16 | 1.003008 | -4.337186 | -0.592758 | 0.286958 | 0.704258 | 1.380962 |
| total_fer | 167.0 | 3.044803e-16 | 1.003008 | -1.191250 | -0.763902 | -0.356431 | 0.617525 | 3.009349 |
| gdpp | 167.0 | 5.850277e-17 | 1.003008 | -0.696801 | -0.636660 | -0.454431 | 0.059421 | 5.036507 |
plt.figure(figsize = (12, 7))
km = KMeans(random_state=42)
visualizer = KElbowVisualizer(km, k=(2,10))
visualizer.fit(sData)
visualizer.show()
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
# Silhouette Scores on Scaled Data
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize = (15,8))
ax = [ax1, ax2, ax3, ax4]
for i in range(2,6):
modelKM = KMeans(n_clusters = i)
silViz = SilhouetteVisualizer(modelKM, ax=ax[i-2])
silViz.fit(sData)
txtx = 'Silhouette Score for ' + str(i) + ' clusters: '+ str(round(sil_score(sData, modelKM.labels_), 3))
ax[i-2].set_title(txtx)
Performing K-means using the cluster number 4 according to the silhouette method
modelKM = KMeans(n_clusters = 4)
modelKM.fit(sData)
sPredKM = pd.Series(modelKM.labels_)
#Add the predicted data into a new dataframe
dataKM = data_ini.copy()
dataKM['Cluster'] = sPredKM
dataKM
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 | 0 |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 | 1 |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 | 1 |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 | 0 |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 162 | Vanuatu | 29.2 | 46.6 | 5.25 | 52.7 | 2950 | 2.62 | 63.0 | 3.50 | 2970 | 1 |
| 163 | Venezuela | 17.1 | 28.5 | 4.91 | 17.6 | 16500 | 45.90 | 75.4 | 2.47 | 13500 | 1 |
| 164 | Vietnam | 23.3 | 72.0 | 6.84 | 80.2 | 4490 | 12.10 | 73.1 | 1.95 | 1310 | 1 |
| 165 | Yemen | 56.3 | 30.0 | 5.18 | 34.4 | 4480 | 23.60 | 67.5 | 4.67 | 1310 | 0 |
| 166 | Zambia | 83.1 | 37.0 | 5.89 | 30.9 | 3280 | 14.00 | 52.0 | 5.40 | 1460 | 0 |
167 rows × 11 columns
# Set the categories per group
p = dataKM.Cluster.value_counts()
dataKM.Cluster[dataKM.Cluster == p.index[0]] = 22
dataKM.Cluster[dataKM.Cluster == p.index[1]] = 11
dataKM.Cluster[dataKM.Cluster == p.index[2]] = 33
dataKM.Cluster[dataKM.Cluster == p.index[3]] = 44
dataKM.Cluster.replace({22:2, 11:1, 33:3, 44:4 }, inplace = True)
dataKM
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 | 1 |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 | 2 |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 | 2 |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 | 1 |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 162 | Vanuatu | 29.2 | 46.6 | 5.25 | 52.7 | 2950 | 2.62 | 63.0 | 3.50 | 2970 | 2 |
| 163 | Venezuela | 17.1 | 28.5 | 4.91 | 17.6 | 16500 | 45.90 | 75.4 | 2.47 | 13500 | 2 |
| 164 | Vietnam | 23.3 | 72.0 | 6.84 | 80.2 | 4490 | 12.10 | 73.1 | 1.95 | 1310 | 2 |
| 165 | Yemen | 56.3 | 30.0 | 5.18 | 34.4 | 4480 | 23.60 | 67.5 | 4.67 | 1310 | 1 |
| 166 | Zambia | 83.1 | 37.0 | 5.89 | 30.9 | 3280 | 14.00 | 52.0 | 5.40 | 1460 | 1 |
167 rows × 11 columns
#Set label to the predicted data
cat = {1: 'Undeveloped', 2: 'Developing', 3: 'Developed', 4: 'Well Developed'}
dataKM.Cluster.replace(cat, inplace = True)
dataKM
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 | Undeveloped |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 | Developing |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 | Developing |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 | Undeveloped |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 | Developing |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 162 | Vanuatu | 29.2 | 46.6 | 5.25 | 52.7 | 2950 | 2.62 | 63.0 | 3.50 | 2970 | Developing |
| 163 | Venezuela | 17.1 | 28.5 | 4.91 | 17.6 | 16500 | 45.90 | 75.4 | 2.47 | 13500 | Developing |
| 164 | Vietnam | 23.3 | 72.0 | 6.84 | 80.2 | 4490 | 12.10 | 73.1 | 1.95 | 1310 | Developing |
| 165 | Yemen | 56.3 | 30.0 | 5.18 | 34.4 | 4480 | 23.60 | 67.5 | 4.67 | 1310 | Undeveloped |
| 166 | Zambia | 83.1 | 37.0 | 5.89 | 30.9 | 3280 | 14.00 | 52.0 | 5.40 | 1460 | Undeveloped |
167 rows × 11 columns
After the data has been classified, the results are visualized across a geographic area.
fig = px.choropleth(dataKM,
locationmode='country names',
locations='country',
color='Cluster',
color_discrete_map = {'Undeveloped':'#DB1C18', 'Developing':'#EBB331',
'Developed':'#67E232', 'Well Developed':'#51A2DB'} ,
title='Coutries by categories'
)
fig.show()
Most of the Underdeveloped countries seems to be in African Continent.
px.choropleth(data_frame = dataKM, locationmode = 'country names', locations = 'country',
color = dataKM.Cluster, title = 'Countries in African continent',
color_discrete_map = {'Undeveloped':'#DB1C18', 'Developing':'#EBB331',
'Developed':'#67E232', 'Well Developed':'#51A2DB'},
projection='equirectangular', scope = 'africa')
Some of the Undeveloped countries seem to be in Asian Continent
px.choropleth(data_frame = dataKM, locationmode = 'country names', locations = 'country',
color = dataKM.Cluster, title = 'Countries in African continent',
color_discrete_map = {'Undeveloped':'#DB1C18', 'Developing':'#EBB331',
'Developed':'#67E232', 'Well Developed':'#51A2DB'},
scope = 'asia')
List of undeveloped countries
focusData = dataKM[dataKM.Cluster == 'Undeveloped']
focusData['country']
0 Afghanistan 3 Angola 17 Benin 25 Burkina Faso 26 Burundi 28 Cameroon 31 Central African Republic 32 Chad 36 Comoros 37 Congo, Dem. Rep. 38 Congo, Rep. 40 Cote d'Ivoire 49 Equatorial Guinea 50 Eritrea 55 Gabon 56 Gambia 59 Ghana 63 Guinea 64 Guinea-Bissau 66 Haiti 72 Iraq 80 Kenya 81 Kiribati 84 Lao 87 Lesotho 88 Liberia 93 Madagascar 94 Malawi 97 Mali 99 Mauritania 106 Mozambique 108 Namibia 112 Niger 113 Nigeria 116 Pakistan 126 Rwanda 129 Senegal 132 Sierra Leone 137 South Africa 142 Sudan 147 Tanzania 149 Timor-Leste 150 Togo 155 Uganda 165 Yemen 166 Zambia Name: country, dtype: object
I can conclude that the countries that will need aid are in the African Continent and some in the Asian Continent. To get this result, I used K-Means algorithm because it is a very powerful clusterization algorithm for dividing data into categories. But before performing this algorithm, I used the Elbow and Silhouette methods to find the optimal number of clusters. The Elbow method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The Silhouette method provides a succinct graphical representation of how well each object has been classified. Both methods were used because they provide valuable information for clustering analysis. For this data, the cluster number were the same on both methods.